Goto

Collaborating Authors

 contact mask


KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Xie, Weiji, Han, Jinrui, Zheng, Jiakun, Li, Huanyu, Liu, Xinzhe, Shi, Jiyuan, Zhang, Weinan, Bai, Chenjia, Li, Xuelong

arXiv.org Artificial Intelligence

Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.


Visual-auditory Extrinsic Contact Estimation

Yi, Xili, Lee, Jayjun, Fazeli, Nima

arXiv.org Artificial Intelligence

Estimating contact locations between a grasped object and the environment is important for robust manipulation. In this paper, we present a visual-auditory method for extrinsic contact estimation, featuring a real-to-sim approach for auditory signals. Our method equips a robotic manipulator with contact microphones and speakers on its fingers, along with an externally mounted static camera providing a visual feed of the scene. As the robot manipulates objects, it detects contact events with surrounding surfaces using auditory feedback from the fingertips and visual feedback from the camera. A key feature of our approach is the transfer of auditory feedback into a simulated environment, where we learn a multimodal representation that is then applied to real world scenes without additional training. This zero-shot transfer is accurate and robust in estimating contact location and size, as demonstrated in our simulated and real world experiments in various cluttered environments.


H-FCBFormer Hierarchical Fully Convolutional Branch Transformer for Occlusal Contact Segmentation with Articulating Paper

Banks, Ryan, Rovira-Lastra, Bernat, Martinez-Gomis, Jordi, Chaurasia, Akhilanand, Li, Yunpeng

arXiv.org Artificial Intelligence

Occlusal contacts are the locations at which the occluding surfaces of the maxilla and the mandible posterior teeth meet. Occlusal contact detection is a vital tool for restoring the loss of masticatory function and is a mandatory assessment in the field of dentistry, with particular importance in prosthodontics and restorative dentistry. The most common method for occlusal contact detection is articulating paper. However, this method can indicate significant medically false positive and medically false negative contact areas, leaving the identification of true occlusal indications to clinicians. To address this, we propose a multiclass Vision Transformer and Fully Convolutional Network ensemble semantic segmentation model with a combination hierarchical loss function, which we name as Hierarchical Fully Convolutional Branch Transformer (H-FCBFormer). We also propose a method of generating medically true positive semantic segmentation masks derived from expert annotated articulating paper masks and gold standard masks. The proposed model outperforms other machine learning methods evaluated at detecting medically true positive contacts and performs better than dentists in terms of accurately identifying object-wise occlusal contact areas while taking significantly less time to identify them.


UV-Based 3D Hand-Object Reconstruction with Grasp Optimization

Yu, Ziwei, Yang, Linlin, Xie, You, Chen, Ping, Yao, Angela

arXiv.org Artificial Intelligence

This set-up lends itself well to AR/VR settings where the hand interacts with a predefined object, perhaps with markers to facilitate the object pose estimation. Such a setting is common, although the majority of previous works [31, 33, 34] consider 3D point clouds as input, while we handle the more difficult case of monocular RGB inputs. Additionally, the previous works [12, 13, 31, 33, 34, 60] are singularly focused on reconstructing feasible hand-object interactions. They aim to produce hand meshes with minimal penetration to the 3D object without regard for the accuracy of the 3D hand pose. We take on the additional challenge of balancing realistic hand-object interactions with accurate 3D hand poses. Representation-wise, previous hand-object 3D reconstruction works [2, 35, 61, 72] predominantly with the MANO model [55].